Il semble que tous les graphes que j'ai réalisés ne sont pas affichés sur le rapport html, je vous les ai enregistrés en tant qu'image normalement il vous suffit d'installer le module plotly et de run le notebook. (Qui prend dure minutes à cause des requête de position)

In [1]:
import pandas as pd
In [2]:
ticket_data = pd.read_csv('Data/ticket_data.csv')
cities = pd.read_csv('Data/cities.csv')
stations = pd.read_csv('Data/stations.csv')
providers = pd.read_csv('Data/providers.csv')

Data Exploration

In [3]:
ticket_data
Out[3]:
id company o_station d_station departure_ts arrival_ts price_in_cents search_ts middle_stations other_companies o_city d_city
0 6795025 8385 NaN NaN 2017-10-13 14:00:00+00 2017-10-13 20:10:00+00 4550 2017-10-01 00:13:31.327+00 NaN NaN 611 542
1 6795026 9 63.0 1044.0 2017-10-13 13:05:00+00 2017-10-14 06:55:00+00 1450 2017-10-01 00:13:35.773+00 {149,418} {13} 611 542
2 6795027 8377 5905.0 6495.0 2017-10-13 13:27:00+00 2017-10-14 21:24:00+00 7400 2017-10-01 00:13:40.212+00 {798,798,6794,6246} {8377,8376} 611 542
3 6795028 8377 5905.0 6495.0 2017-10-13 13:27:00+00 2017-10-14 11:02:00+00 13500 2017-10-01 00:13:40.213+00 {798,798,6794,6246} {8377,8376} 611 542
4 6795029 8381 5905.0 6495.0 2017-10-13 21:46:00+00 2017-10-14 19:32:00+00 7710 2017-10-01 00:13:40.213+00 {5983,5983} {8380} 611 542
... ... ... ... ... ... ... ... ... ... ... ... ...
74163 6869182 13 279.0 10729.0 2017-10-27 02:30:00+00 2017-10-27 16:00:00+00 2590 2017-10-19 10:35:42.943+00 {408,408} {8371} 628 562
74164 6869185 9 279.0 304.0 2017-10-27 07:00:00+00 2017-10-27 13:30:00+00 2700 2017-10-19 10:35:42.943+00 {1105,1105} {9} 628 562
74165 6869187 8371 10642.0 304.0 2017-10-27 08:30:00+00 2017-10-27 15:15:00+00 3000 2017-10-19 10:35:42.943+00 {1105,1105} {9} 628 562
74166 6869189 13 279.0 304.0 2017-10-27 13:25:00+00 2017-10-27 21:10:00+00 2880 2017-10-19 10:35:42.943+00 {863,863} {13} 628 562
74167 6869191 8371 10642.0 10729.0 2017-10-27 14:45:00+00 2017-10-27 21:40:00+00 2000 2017-10-19 10:35:42.943+00 {863,863} {8371} 628 562

74168 rows × 12 columns

In [4]:
cities
Out[4]:
id local_name unique_name latitude longitude population
0 5159 Padua, Veneto, Italia padua 45.406435 11.876761 209678.0
1 76 Barcelona, Cataluña, España barcelona 41.385064 2.173404 1611822.0
2 81 Basel, Basel-Stadt, Schweiz basel 47.593437 7.619812 NaN
3 259 Erlangen, Bayern, Deutschland erlangen 49.589674 11.011961 105412.0
4 11979 Balș, Olt, România balș 44.353354 24.095672 NaN
... ... ... ... ... ... ...
8035 9964 Saint-André-lez-Lille, Hauts-de-France, France saint-andre-lez-lille 50.654250 3.047690 NaN
8036 10863 Kiev, Шевченківський район, Ukraine kiev 50.450100 30.523400 NaN
8037 11144 Brighton, England, UK brighton 50.822000 -0.137400 NaN
8038 11377 Manises, Comunitat Valenciana, España manises 39.493000 -0.462100 NaN
8039 11537 South Cambridgeshire, England, UK south-cambridgeshire 52.179600 -0.003400 NaN

8040 rows × 6 columns

In [5]:
stations
Out[5]:
id unique_name latitude longitude
0 1 Aalen (Stuttgarter Straße) 48.835296 10.092956
1 2 Aéroport Bordeaux-Mérignac 44.830226 -0.700883
2 3 Aéroport CDG 49.009900 2.559310
3 4 Aéroport de Berlin-Schönefeld 52.389446 13.520345
4 5 Aéroport de Dresden 51.123604 13.764737
... ... ... ... ...
11030 11032 Porto Avda De Los Aliados 41.146922 -8.611122
11031 11033 Rom Ciampino Flughafen Piazzale Leonardo da Vinci 41.804102 12.597110
11032 11034 Sants 41.379892 2.139462
11033 11035 Vozilici D66 45.157136 14.164140
11034 11036 Wien Hotel ibis Vienna Airport 48.123229 16.437528

11035 rows × 4 columns

In [6]:
providers
Out[6]:
id company_id provider_id name fullname has_wifi has_plug has_adjustable_seats has_bicycle transport_type
0 9 1 NaN ouibus Ouibus True True True False bus
1 10 2 NaN deinbus Deinbus.de False False False False bus
2 11 3 NaN infobus Infobus False False False False bus
3 12 4 NaN studentAgency Student Agency False False False False bus
4 13 5 NaN flixbus Flixbus True False False False bus
... ... ... ... ... ... ... ... ... ... ...
222 8384 10 16 thalys Thalys True True False True train
223 8385 11 NaN bbc Blablacar False False False False carpooling
224 8387 30 NaN vatry Vatry NaN NaN NaN NaN bus
225 8388 40 NaN beauval Beauval NaN NaN NaN NaN bus
226 8389 12 NaN idvroom idVroom False False False False carpooling

227 rows × 10 columns

In [7]:
 
Out[7]:
id company_id provider_id name fullname has_wifi has_plug has_adjustable_seats has_bicycle transport_type
0 9 1 NaN ouibus Ouibus True True True False bus
1 10 2 NaN deinbus Deinbus.de False False False False bus
2 11 3 NaN infobus Infobus False False False False bus
3 12 4 NaN studentAgency Student Agency False False False False bus
4 13 5 NaN flixbus Flixbus True False False False bus
5 14 6 NaN driivme DriiveMe False False False False car
6 15 8 NaN alsa Alsa False False False False bus
7 16 9 NaN distribusion Distribusion False False False False bus
8 17 9 31 expressbusMunchen Expressbus Munchen False False False False bus
9 18 9 33 berlinLinienBus Berlin Linien Bus False False False False bus
In [7]:
import folium
from folium.plugins import MarkerCluster
map = folium.Map(location=[cities.latitude.median(),cities.longitude.median()],zoom_start=5)

cluster = MarkerCluster().add_to(map)
for point in cities.index:
    folium.CircleMarker([cities.latitude[point],cities.longitude[point]], popup=str(cities.id[point])+"\n"+ cities.local_name[point], radius = 4, color='crimson', fill=True, fillColor= 'crimson',).add_to(cluster)
In [8]:
map
Out[8]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Nous pouvons voir par exemple à aix en provence que nous avons 2 id différents pour une même ville. Tentons donc de fusionner 2 positions très proches

Ceci est bien confirmé pour aix en provence nous voyons bien 2 positions sur la gare nous allons tenter de fusionner tous les points

In [9]:
mask = cities.local_name.str.lower().str.contains("aix-en-provence")
cities[mask]
Out[9]:
id local_name unique_name latitude longitude population
27 5444 Aix-en-Provence Centre, Provence-Alpes-Côte d'... aix-en-provence-centre 43.5229 5.4453 NaN
3355 11 Aix-en-provence, Provence-Alpes-Côte d'Azur, F... aix-en-provence 43.5232 5.4451 142149.0
5652 5446 Aix-en-Provence - Gare TGV (à 8km du centre vi... aix-en-provence---gare-tgv- 43.4552 5.3171 NaN

Suppression des villes en doublon

In [10]:
cities["pos"] = cities.latitude.round(3).astype('str')+cities.longitude.round(3).astype('str')
#Fonction permettant de récupérer le nom de ville le plus court pour par exemple avoir aix en provence plutot que aix en provence gare routière
def getSmallestStr(tab):
    name = [name for name in tab]
    length = [len(elem) for elem in tab] 
    return name[length.index(min(length))]

citiescorrected  = cities.groupby("pos",as_index=False).agg({"id":lambda x:list(x),"population":"max","longitude":"mean","latitude":"mean","local_name": lambda x: getSmallestStr(x)}).drop("pos", axis=1)
citiescorrected
Out[10]:
id population longitude latitude local_name
0 [10711] NaN 48.731938 1.370676 , ,
1 [2144] NaN 34.661667 29.035000 Nuweiba, Sud de Sinaï, Egypt
2 [1163, 4911] NaN -90.071532 29.951066 Nola, Campania, Italia
3 [10601] 108469.0 -16.908678 32.649650 Funchal, Madeira, Arquipelago da Madeira (Port...
4 [10635] NaN -16.765907 32.722875 Machico, Madeira, Arquipelago da Madeira (Port...
... ... ... ... ... ...
7030 [994] NaN 10.740665 59.899446 Oslo, Oslo, Norge
7031 [8640] NaN 30.314130 59.938630 Saint-Petersburg, Санкт-Петербург, РФ
7032 [6839] 620715.0 24.941300 60.170900 Helsinki, Etelä-Suomi, Suomi
7033 [9049] 182072.0 22.268690 60.451480 Turku, Etelä-Suomi, Suomi
7034 [2312] NaN -21.942237 64.145981 Reykjavík, Höfuðborgarsvæðið, Ísland

7035 rows × 5 columns

In [11]:
map = folium.Map(location=[citiescorrected.latitude.median(),citiescorrected.longitude.median()],zoom_start=5)

cluster = MarkerCluster().add_to(map)
for point in citiescorrected.index:
    folium.CircleMarker([citiescorrected.latitude[point],citiescorrected.longitude[point]], popup=str(citiescorrected.id[point])+"\n"+ citiescorrected.local_name[point], radius = 4, color='crimson', fill=True, fillColor= 'crimson',).add_to(cluster)
map
Out[11]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Les erreur sont bien corrigées. Travaillons maintenant sur la jointure des différente tables avcec les id de ville corrigés

In [12]:
#On créé et rempli un dictionnaire dans lequel on fait correspondre l'ancien Id de ville avec le nouveau
idConverter = {}
for i in citiescorrected.index:
    idConverter.update({elem:citiescorrected.index[i] for elem in citiescorrected["id"][i]})


#On associe le nouvel id pour chaque station de départ
newColumn = []
for row in ticket_data["o_city"]:
    newColumn.append(idConverter[row])
ticket_data["o_city"] = newColumn


#On associe le nouvel id pour chaque sattion d'arrivée
newColumn = []
for row in ticket_data["d_city"]:
    newColumn.append(idConverter[row])
ticket_data["d_city"] = newColumn
In [13]:
ticket_data
Out[13]:
id company o_station d_station departure_ts arrival_ts price_in_cents search_ts middle_stations other_companies o_city d_city
0 6795025 8385 NaN NaN 2017-10-13 14:00:00+00 2017-10-13 20:10:00+00 4550 2017-10-01 00:13:31.327+00 NaN NaN 3943 1094
1 6795026 9 63.0 1044.0 2017-10-13 13:05:00+00 2017-10-14 06:55:00+00 1450 2017-10-01 00:13:35.773+00 {149,418} {13} 3943 1094
2 6795027 8377 5905.0 6495.0 2017-10-13 13:27:00+00 2017-10-14 21:24:00+00 7400 2017-10-01 00:13:40.212+00 {798,798,6794,6246} {8377,8376} 3943 1094
3 6795028 8377 5905.0 6495.0 2017-10-13 13:27:00+00 2017-10-14 11:02:00+00 13500 2017-10-01 00:13:40.213+00 {798,798,6794,6246} {8377,8376} 3943 1094
4 6795029 8381 5905.0 6495.0 2017-10-13 21:46:00+00 2017-10-14 19:32:00+00 7710 2017-10-01 00:13:40.213+00 {5983,5983} {8380} 3943 1094
... ... ... ... ... ... ... ... ... ... ... ... ...
74163 6869182 13 279.0 10729.0 2017-10-27 02:30:00+00 2017-10-27 16:00:00+00 2590 2017-10-19 10:35:42.943+00 {408,408} {8371} 4688 3423
74164 6869185 9 279.0 304.0 2017-10-27 07:00:00+00 2017-10-27 13:30:00+00 2700 2017-10-19 10:35:42.943+00 {1105,1105} {9} 4688 3423
74165 6869187 8371 10642.0 304.0 2017-10-27 08:30:00+00 2017-10-27 15:15:00+00 3000 2017-10-19 10:35:42.943+00 {1105,1105} {9} 4688 3423
74166 6869189 13 279.0 304.0 2017-10-27 13:25:00+00 2017-10-27 21:10:00+00 2880 2017-10-19 10:35:42.943+00 {863,863} {13} 4688 3423
74167 6869191 8371 10642.0 10729.0 2017-10-27 14:45:00+00 2017-10-27 21:40:00+00 2000 2017-10-19 10:35:42.943+00 {863,863} {8371} 4688 3423

74168 rows × 12 columns

Vérifions maintenant qu'il s'agisse bien de middle stations et pas de middle cities pour faire des requêtes correctes dans la partie suivante. Pour cela nous allons réaliser un set contenant tous les id de middle station et vérifier que ce grand set est bien un superrset de chaque middle station (que chaque élement des middle stations est bien contenu dans le set)

In [14]:
stationsSuperSet = set(list(stations["id"]))
for line in ticket_data["middle_stations"]:
    if isinstance(line, set) and not stationsSuperSet.issuperset(line):
        print("error",line)

Pour pouvoir faire des prédictions de prix au km il va falloir calculer la distance avec une API, le plus simple serait de faire le calcul directement avec la latitude et la longitude, malheureusement nous ne pouvons pas nous déplacer toujours tout droit. On utilisera donc une API de route planning pour avoir un ordre d'idée pour chaque distance, vu que j'utilise une API open source je vais essayer de limiter au maximum le nombre de call ainsi concentrons nous sur la recherche de toutes les combinaisons de départ et d'arrivée présentées dans le DF. Notez qu'on met toujours le plus petit id de ville en premier, on fait encore une approximation et on se dirige donc sur un unoriented graph. Je fais cela surtout pour réduire le nombre de call API le passage en unoriented graph nous divise par presque 2 le nombre de call à l'API. Pour une simple Data analyse je ne pense pas que cela soit problématique.

In [15]:
combinaisons = {}
for lineId in ticket_data.index:
    pos = [ticket_data["o_city"][lineId],ticket_data["d_city"][lineId]]
    mini = min(pos)
    maxi = max(pos)
    #si la ville avec le plus petit ID n'est pas encore dans le dico on l'ajoute
    if mini in combinaisons:
        combinaisons[mini].add(maxi)
    #sinon on ajoute la ville avec le plus grand id dans les villes reliées à cette ville
    else:
        combinaisons.update({mini:set([maxi])})
In [16]:
#maintenant on reformat le dictionnaire de set fait avant afin d'avoir une liste de couple de ville
combinaisonsfinales = []
for start in combinaisons.items():
    combinaisonsfinales = combinaisonsfinales+[[start[0],dest] for dest in start[1]]
print(combinaisonsfinales[:10])
print(len(combinaisonsfinales))
[[1094, 4609], [1094, 1282], [1094, 1796], [1094, 1412], [1094, 4104], [1094, 1810], [1094, 2206], [1094, 2339], [1094, 1316], [1094, 5923]]
1182
In [17]:
#Retourne la longitute et la latitude associée à l'id de de ville donné (bizarrement l'api veut les coordonnées sous la forme long lat et non l'habituel lat long)
def GetCityPos(id):
    city = citiescorrected[id:id+1]
    return [str(city["longitude"][id]),str(city["latitude"][id])]

#On créer un tableau ou l'on stocke non pas une ville de départ et une ville d'arrivée maisdes coordonnées de départ et des coordonnées d'arrivée 
positionsToQuery = [[GetCityPos(trajet[0]),GetCityPos(trajet[1])] for trajet in combinaisonsfinales ] 
print(positionsToQuery[:10])
[[['3.918318', '43.604452'], ['2.3951432999999995', '48.745321399999995']], [['3.918318', '43.604452'], ['-0.50459', '43.885149']], [['3.918318', '43.604452'], ['10.327903599999999', '44.801485']], [['3.918318', '43.604452'], ['3.077801', '44.100575']], [['3.918318', '43.604452'], ['11.5819806', '48.1351253']], [['3.918318', '43.604452'], ['-0.5738451', '44.83357']], [['3.918318', '43.604452'], ['4.402894', '45.44245']], [['3.918318', '43.604452'], ['0.124071', '45.635457']], [['3.918318', '43.604452'], ['2.148', '43.9298']], [['3.918318', '43.604452'], ['4.3326400000000005', '50.834998999999996']]]
In [18]:
import requests, json
distance = []
for city in positionsToQuery:
    posformat = ";".join([",".join(city[0]),",".join(city[1])])
    # on fait notre requête
    r = requests.get('http://router.project-osrm.org/route/v1/driving/{}?overview=false'.format(posformat))

    # On récupère la distance correspondant au trajet
    route = json.loads(r.content)
    distance.append(route["routes"][0]["distance"])

Maintenant nous allons rajouter la distance parcourue pour chaque ticket dans le df d'abord il faut faire un outil pour retrouver rapidement la ville de départ et la ville d'arrivée associées au trajet

In [19]:
previousStartingPoint = None
distanceStorage = {}
for i in range(len(combinaisonsfinales)):
    #si le point de départ n'est pas encore référencé dans le dico on l'ajoute
    if previousStartingPoint != combinaisonsfinales[i][0]:
        previousStartingPoint = combinaisonsfinales[i][0]
        distanceStorage.update({previousStartingPoint:{}})
    #On ajoute la distance entre le point de départ et le point d'arrivée dans le dico (et on la converti en km au passage)
    distanceStorage[previousStartingPoint].update({combinaisonsfinales[i][1]:round(distance[i]/1000,2)})
print(distanceStorage)
{1094: {4609: 746.61, 1282: 422.11, 1796: 704.72, 1412: 115.59, 4104: 1030.9, 1810: 485.0, 2206: 316.05, 2339: 592.95, 1316: 200.11, 5923: 1001.06, 5677: 963.89, 1329: 90.86, 3639: 775.92, 4688: 756.47, 1104: 243.61, 1878: 197.15, 4055: 895.96, 5594: 955.14, 3423: 832.27, 2016: 438.13, 3298: 814.32, 2787: 444.61, 3943: 627.59, 2425: 298.81, 2430: 337.36, 2047: 290.63}, 4688: {5120: 136.82, 5646: 287.15, 5135: 83.96, 5652: 223.58, 6684: 883.49, 5662: 192.77, 5151: 144.34, 5158: 200.66, 6187: 533.42, 6699: 841.68, 6701: 902.49, 6215: 620.64, 6217: 568.41, 5210: 138.2, 5213: 102.83, 5741: 234.91, 5238: 356.9, 5759: 225.44, 5800: 368.32, 5291: 244.59, 5804: 221.28, 5301: 108.52, 4791: 181.63, 6860: 1608.65, 6351: 646.7, 6355: 444.36, 4826: 95.95, 5344: 231.61, 5856: 254.51, 5858: 231.74, 5890: 420.38, 4881: 330.75, 5394: 559.5, 4886: 399.3, 5399: 136.23, 5912: 247.75, 5923: 304.12, 6950: 1357.86, 6442: 544.62, 6449: 1586.0, 6964: 1459.67, 4925: 237.06, 4930: 382.77, 6476: 508.75, 5465: 559.82, 6490: 513.76, 4956: 146.92, 5982: 486.41, 5983: 288.73, 6498: 743.27, 5479: 1538.08, 5992: 291.86, 7019: 2004.73, 6508: 1049.44, 5484: 1029.8, 7030: 1957.02, 5498: 178.18, 5502: 571.27, 6024: 296.04, 5009: 248.9, 6549: 732.65, 5038: 699.95, 6084: 293.01, 6597: 791.58, 5577: 181.07, 6090: 342.92, 6089: 342.07, 5070: 338.39, 6095: 494.52, 5078: 203.25, 5079: 203.25, 5594: 206.3, 5089: 128.09, 6128: 956.93, 5106: 188.47, 5107: 80.8, 5624: 194.56}, 4055: {5120: 315.94, 5923: 648.93, 6659: 992.15, 4901: 253.17, 4484: 831.11, 5800: 713.13, 5577: 525.88, 5084: 445.31, 5804: 566.09, 4925: 188.55, 4141: 246.46, 4688: 346.14, 4920: 601.14, 4445: 103.9, 4093: 41.47, 4319: 242.82}, 4881: {5923: 276.3, 4925: 568.13, 5502: 245.63}, 1810: {5120: 653.93, 4609: 575.09, 2307: 593.18, 2181: 105.16, 4479: 593.08, 2698: 175.33, 3475: 424.8, 5399: 719.54, 2970: 331.91, 5923: 883.23, 5158: 651.36, 2474: 151.05, 2859: 185.04, 5804: 800.39, 2733: 180.33, 4528: 542.41, 3506: 640.26, 1971: 713.37, 4531: 531.07, 4791: 762.88, 2047: 652.12, 3772: 460.39, 4925: 600.97, 2882: 369.07, 3012: 249.96, 6215: 952.88, 3402: 450.51, 4688: 580.43, 1878: 677.76, 3415: 701.24, 4055: 455.88, 4826: 670.39, 2011: 125.54, 3423: 353.17, 2016: 196.96, 2787: 703.7, 4709: 582.45, 3560: 348.78, 4591: 563.07, 5878: 841.12, 2679: 330.81, 5624: 773.66, 2425: 550.52, 2430: 368.95, 5375: 745.95}, 2425: {5120: 593.38, 2561: 139.57, 4484: 487.94, 3076: 294.03, 2824: 80.68, 2700: 47.89, 3857: 299.57, 3986: 390.68, 2584: 223.62, 2459: 24.13, 6684: 1340.06, 4637: 626.43, 2976: 211.8, 3105: 127.14, 5923: 706.45, 4259: 376.46, 3754: 339.77, 3243: 257.15, 5804: 691.2, 2733: 698.95, 3506: 193.23, 3123: 354.54, 4534: 407.44, 3639: 596.3, 2620: 664.25, 4925: 693.62, 4801: 486.29, 6215: 1077.2, 4936: 1277.75, 6476: 919.73, 4688: 461.85, 3415: 249.95, 4055: 716.34, 3549: 425.6, 5982: 723.55, 3423: 677.54, 5858: 701.66, 2787: 149.71, 6498: 1199.84, 3943: 447.97, 3304: 338.77, 5479: 1605.43, 4202: 459.0, 3051: 558.09, 6508: 1230.44, 3827: 377.3, 6387: 878.11, 2430: 165.76}, 3506: {5344: 378.24, 4609: 299.96, 4930: 323.0, 4225: 243.6, 6215: 925.16, 4688: 309.81, 3988: 830.65, 4348: 359.31, 4148: 216.64, 4534: 218.88, 4285: 616.6, 4956: 300.94, 3549: 338.58}, 1347: {4688: 628.69, 5804: 848.65, 1878: 483.46, 3055: 499.82}, 764: {1027: 81.76, 4484: 862.96, 4881: 830.77, 1810: 709.37, 2206: 395.55, 3105: 501.25, 810: 376.99, 3754: 714.78, 5804: 1065.31, 6215: 1451.31, 1225: 189.73, 5577: 1038.71, 4688: 835.96, 1104: 467.97, 855: 64.09, 3415: 624.97, 3423: 1056.64, 3560: 854.48, 1131: 544.34, 2425: 378.31, 1146: 142.8, 2047: 370.12}, 5135: {5577: 157.01}, 1157: {1251: 33.91, 5804: 970.97, 3055: 847.17, 4688: 741.63, 4081: 578.52, 4881: 736.43, 5594: 940.3, 2970: 882.98}, 855: {4609: 761.62, 2561: 422.95, 4484: 798.47, 1412: 278.88, 4881: 766.28, 4758: 746.93, 5399: 910.18, 1048: 171.36, 1307: 98.46, 3995: 891.56, 1436: 163.78, 2206: 331.06, 3105: 436.76, 2339: 752.83, 2217: 514.67, 5804: 1000.82, 1455: 510.77, 1329: 103.01, 3123: 663.81, 3639: 905.57, 2489: 690.64, 5177: 819.48, 1081: 182.0, 2430: 475.03, 964: 140.63, 3012: 777.4, 1094: 163.09, 6215: 1386.82, 975: 25.43, 4688: 771.48, 1104: 403.49, 4051: 726.58, 1110: 333.48, 4055: 1025.61, 5210: 860.87, 6490: 1234.36, 1628: 226.97, 2778: 468.85, 4956: 799.11, 3423: 992.15, 2016: 598.01, 5344: 876.42, 5858: 1011.29, 2787: 459.62, 1251: 121.83, 862: 591.16, 3943: 757.24, 3560: 789.99, 5992: 1071.2, 874: 351.36, 5484: 1385.97, 1390: 111.52, 1774: 192.98, 1520: 346.68, 3827: 687.83, 888: 36.91, 1656: 181.83, 2425: 313.82, 1146: 192.38, 1147: 188.09, 5502: 1000.79}, 4534: {5804: 410.56, 6508: 847.04, 4688: 347.44, 4981: 128.73, 5399: 378.97}, 4259: {4609: 161.61, 4484: 406.04, 4688: 181.45, 4659: 60.64, 5084: 172.93}, 1616: {1810: 623.16, 2047: 149.31, 3943: 600.91}, 2217: {4688: 846.74, 2425: 440.16, 2561: 336.02}, 5399: {5856: 128.14, 5804: 140.91, 6476: 428.38}, 5577: {5856: 116.03, 5992: 112.46, 5737: 34.79, 5741: 99.7, 5652: 94.83, 6422: 351.39}, 862: {3943: 670.04, 4688: 789.6, 1009: 113.43, 5399: 928.7, 2425: 725.11, 2970: 541.08, 2270: 1255.37, 2206: 725.07}, 2979: {4688: 453.09}, 1104: {1546: 148.43, 3857: 610.31, 1810: 244.55, 5009: 855.91, 3988: 816.46, 2711: 422.11, 3488: 548.9, 2339: 352.5, 5923: 978.67, 2854: 600.31, 2217: 876.49, 3754: 797.89, 1455: 110.44, 3506: 647.09, 4531: 769.73, 4534: 861.3, 3639: 636.74, 2489: 290.31, 4925: 839.63, 1214: 590.72, 4801: 700.32, 2372: 613.28, 6215: 1191.54, 6476: 1183.3, 4688: 675.88, 3154: 493.69, 4055: 694.53, 3423: 591.82, 2016: 197.68, 1251: 290.21, 1635: 172.36, 2787: 683.62, 2663: 495.29, 3560: 587.43, 1131: 77.5, 5107: 755.35, 2425: 537.82, 4987: 925.84, 2300: 589.39, 2047: 529.63}, 4531: {4688: 408.24, 4920: 669.44, 4925: 169.67}, 4011: {4688: 354.67}, 2047: {5089: 694.38, 2307: 64.37, 3943: 574.66, 6476: 1029.45, 4688: 571.58, 4050: 770.04, 4055: 818.79, 3827: 462.91, 3639: 698.75, 2425: 105.34, 4826: 653.9, 5399: 710.28, 4445: 940.27, 2206: 152.26}, 2430: {5120: 529.89, 5698: 664.24, 2663: 104.74, 3560: 333.02, 4925: 592.82, 6701: 1280.09, 2733: 512.15, 4688: 421.37, 3506: 331.47, 5107: 500.85, 7002: 2043.41, 3005: 122.35}, 1048: {4801: 925.14, 1251: 251.05, 4484: 820.09, 1094: 292.3, 3943: 886.46, 5804: 1130.04, 4688: 900.7, 3506: 632.08, 3415: 689.7, 2425: 443.04, 2047: 434.86}, 3423: {5120: 379.4, 4609: 368.47, 4483: 293.0, 4484: 858.79, 3846: 165.21, 3986: 411.94, 5399: 499.18, 3995: 177.17, 5923: 676.62, 4259: 515.35, 5804: 593.78, 4657: 362.83, 6196: 716.23, 4534: 750.58, 3639: 83.3, 3772: 114.05, 3525: 243.72, 6476: 881.25, 4688: 373.82, 4050: 151.25, 4055: 104.14, 5084: 472.99, 4706: 406.16, 3560: 208.3, 6508: 1421.94, 4717: 409.09, 4336: 222.75, 5106: 357.99, 6387: 839.63, 5878: 608.16, 4348: 290.1, 4093: 142.86, 5759: 597.93}, 79: {4688: 1537.61, 2425: 1144.46, 598: 696.61}, 1000: {2016: 388.23, 5677: 975.56, 4688: 772.32, 1104: 306.4, 1810: 190.87, 2425: 741.79}, 1027: {2561: 407.66, 2787: 444.34, 1251: 106.55, 1146: 168.19, 5804: 985.54, 2733: 804.02, 5070: 777.88, 4688: 756.19, 4881: 751.0, 1810: 629.59, 2003: 598.39, 3639: 890.28, 1048: 147.17, 2425: 298.54, 2778: 453.56, 5399: 894.89, 2430: 459.74}, 813: {855: 276.0}, 3147: {3296: 741.05, 3772: 163.28}, 4051: {5120: 678.94, 4484: 72.5, 6215: 834.46, 4591: 560.0, 4688: 533.47}, 3296: {4688: 360.64, 4484: 300.73, 3827: 190.08, 6476: 762.07}, 754: {1104: 71.6}, 5741: {5804: 182.28, 6215: 235.0}, 4479: {4688: 482.04}, 3304: {5923: 547.36, 6215: 801.47, 6476: 751.98, 4688: 244.56, 3506: 244.04}, 4525: {4688: 246.69, 5804: 309.81, 5624: 283.09}, 796: {3560: 695.75, 4688: 784.11, 1104: 151.37, 2898: 568.64, 1110: 122.47, 1436: 183.03, 862: 339.04}, 598: {5120: 952.74, 3105: 574.01, 5858: 1074.64, 2787: 596.87, 5923: 1147.01, 1027: 301.9, 966: 297.81, 3943: 723.1, 2425: 451.07, 5804: 1064.17, 3055: 737.73, 4688: 844.22, 1810: 447.52, 4055: 897.5, 1878: 349.41, 855: 316.19, 1177: 475.14, 1146: 467.44}, 4484: {5120: 635.79, 4609: 493.73, 5698: 448.81, 5923: 435.72, 6116: 711.41, 4805: 82.81, 4956: 351.72, 5804: 553.43, 6701: 703.13, 5070: 188.88, 5646: 415.37, 4688: 490.32, 4881: 164.79, 6449: 1293.32, 5399: 521.84, 6041: 692.79, 5084: 403.88, 5502: 216.58}, 4445: {5923: 701.8, 4742: 455.79}, 148: {1027: 834.56, 1251: 736.12, 4688: 1376.87, 1329: 775.77, 1810: 782.46, 2425: 983.73}, 878: {4688: 755.15}, 756: {6215: 1469.91, 3247: 664.78, 4688: 854.56, 855: 82.69, 856: 67.19}, 1146: {1412: 430.36, 1157: 243.38, 2824: 542.68, 6666: 1764.85, 5135: 1009.38, 3986: 851.78, 3988: 1368.26, 3094: 613.93, 5923: 1167.55, 5800: 1142.03, 2217: 325.77, 3506: 654.33, 4534: 868.53, 1207: 42.58, 3772: 1250.85, 2882: 723.39, 3012: 928.87, 4688: 922.95, 4051: 724.26, 6355: 1316.43, 3029: 595.55, 3415: 711.95, 4055: 1177.08, 6366: 1252.78, 3423: 1143.62, 1251: 273.31, 5094: 1002.36, 3560: 941.47, 5624: 1125.57, 2425: 465.29, 4093: 1161.24, 2047: 457.11}, 2381: {3506: 213.25}, 5804: {5857: 47.1, 5923: 108.41, 5925: 117.02, 6215: 283.68, 6090: 124.15, 6476: 289.98, 7030: 1729.55, 6422: 355.12}, 2561: {3105: 205.16, 2787: 42.43, 5804: 752.37, 4688: 539.87, 2584: 95.22, 3423: 813.87}, 871: {1000: 50.26, 3012: 487.12}, 951: {4688: 873.24, 1027: 119.04, 2430: 576.79}, 964: {4688: 870.07, 1878: 310.75, 2425: 412.42, 1146: 57.7, 4956: 897.71}, 5107: {6024: 224.84, 5923: 232.92, 5135: 61.42}, 892: {5120: 863.44, 4484: 851.77, 1094: 73.02, 1104: 180.18, 2321: 350.42, 1329: 159.17, 3988: 993.48, 2300: 418.69, 3385: 770.19, 3772: 876.07}, 3560: {5698: 478.88, 5923: 538.8, 4484: 695.3, 6950: 1592.54, 3943: 116.44, 5800: 603.0, 5709: 546.03, 4688: 236.0, 4659: 395.29, 3639: 127.76}, 1402: {2016: 132.16, 4688: 610.35}, 371: {5923: 1336.44, 1616: 484.58, 4688: 1033.64, 1104: 395.55, 4055: 1086.93, 3217: 942.37, 4534: 1043.73, 855: 505.62, 598: 192.65, 2425: 640.49, 2206: 657.73, 4925: 1205.09, 5502: 1327.46}, 1957: {4104: 609.06, 2425: 305.33, 3943: 774.65, 2047: 236.07}, 5858: {5923: 110.0}, 2733: {4587: 411.34, 5804: 688.25, 4688: 468.3, 4055: 245.98, 3415: 749.13}, 2016: {2561: 496.45, 3298: 341.38, 2787: 513.95, 3943: 358.18, 4688: 479.29, 3857: 413.73, 4055: 513.55, 2425: 360.77, 2047: 462.37, 2430: 179.2, 3423: 410.84}, 2290: {4688: 518.34}, 4050: {4688: 276.43, 4089: 80.76, 4055: 78.58}, 110: {2016: 1345.25, 5923: 2032.14, 3943: 1609.78, 5800: 2096.34, 4688: 1729.34, 341: 310.7, 2425: 1698.81, 796: 1402.9}, 939: {1329: 143.05, 1146: 115.21}, 966: {3105: 416.92, 1455: 490.93, 1616: 138.07, 4688: 751.63, 1878: 192.31, 2489: 670.8}, 810: {4688: 765.92, 1104: 93.2, 5252: 1020.74}, 3415: {4688: 406.71, 4484: 248.5, 3506: 94.81, 6476: 737.21}, 3012: {4706: 367.82, 5923: 638.28, 4484: 794.78, 4688: 335.48, 3154: 79.99}, 2916: {4688: 382.5}, 2859: {3713: 238.74, 6476: 916.6, 3055: 92.93, 4688: 409.18, 3506: 593.07, 3772: 257.22}, 1329: {4609: 677.13, 5923: 931.58, 3639: 821.08, 1957: 459.0, 5800: 906.06, 5804: 916.33, 4688: 686.99, 1616: 73.42, 4055: 941.11, 3415: 475.99, 2425: 229.33, 2047: 221.14, 2206: 246.57, 4319: 1201.52}, 430: {1146: 595.78}, 2205: {4688: 1107.94, 2217: 265.38, 5923: 1141.57, 5484: 815.68}, 1628: {4688: 715.22}, 2206: {5120: 656.69, 5344: 624.0, 3012: 447.22, 5804: 748.4, 2383: 152.27, 4688: 519.06, 2425: 61.4}, 458: {3652: 1216.46, 1189: 680.74, 505: 194.99, 4688: 1417.98, 855: 893.78, 2425: 988.76, 1179: 365.18}, 4826: {4956: 62.65}, 1042: {2047: 289.37}, 3632: {3857: 43.29}, 4609: {5120: 151.31, 6090: 355.19, 5804: 233.55, 6476: 521.02}, 2611: {4609: 453.47, 3506: 190.75}, 1251: {5698: 946.21, 3943: 676.73, 5804: 938.99, 3983: 685.45, 4688: 709.64, 1810: 530.72, 4055: 945.1, 2425: 251.98, 3772: 985.22, 2047: 243.8}, 2824: {3105: 92.12}, 2665: {4688: 1248.0}, 3139: {4688: 300.39}, 5120: {5858: 265.05, 5502: 716.25, 5741: 159.87, 5646: 320.45, 5135: 86.2, 6366: 659.31}, 2787: {4319: 1084.31, 6476: 995.71, 4688: 537.41, 3857: 375.13, 3415: 174.85, 5502: 574.61, 3423: 830.01}, 1938: {5804: 758.05, 2206: 75.77, 2383: 224.77}, 1878: {5120: 693.76, 5804: 791.58, 5677: 769.65, 4525: 468.76, 4688: 562.23, 4534: 507.81, 5662: 763.07, 4319: 1076.76}, 3096: {4688: 1949.95, 6508: 1335.47}, 3038: {4688: 373.74}, 2011: {3298: 422.21, 2339: 95.46, 5804: 765.03, 3506: 516.29, 5238: 840.62, 2425: 426.55}, 1557: {5923: 874.16}, 5923: {6215: 371.52, 5992: 204.03, 6089: 46.42, 6217: 270.84, 6476: 218.42, 6196: 611.36}, 3943: {4484: 587.79, 4688: 128.5, 4148: 141.04, 5624: 321.73, 3998: 503.87}, 941: {4609: 748.17, 964: 296.57, 1000: 529.03, 5677: 956.75, 3870: 656.38}, 3639: {4801: 316.49, 4609: 286.7, 5804: 512.0, 4688: 292.05, 4881: 616.8, 4050: 78.71, 6196: 745.23, 4055: 122.01, 3998: 722.59}, 231: {1094: 933.53, 6090: 1607.85, 3243: 1213.28, 4688: 1266.25, 1104: 798.19, 371: 617.38, 855: 1092.76, 2425: 1227.63, 2430: 1054.14}, 2854: {4688: 393.62, 2859: 588.36, 5923: 638.22}, 736: {1094: 342.26}, 2914: {5759: 596.18}, 3827: {4688: 533.81, 4202: 86.66, 4637: 254.1}, 1763: {4688: 588.39}, 2332: {4688: 669.58}, 2882: {4688: 329.4, 4484: 581.69}, 318: {4688: 1133.07, 5594: 1338.05}, 1081: {2425: 454.38, 1810: 785.43}, 2331: {4688: 505.16, 2339: 108.28, 4639: 457.16}, 2489: {4609: 384.56, 5804: 609.86, 4688: 389.9, 5594: 594.88, 3995: 322.62, 4956: 530.83}, 923: {4688: 785.45}, 837: {855: 841.15}, 2175: {2383: 163.83}, 4483: {4688: 533.19, 5158: 411.87, 4551: 216.68}, 2383: {2425: 105.94}, 2300: {4688: 566.98, 2733: 796.8, 2430: 263.61}, 3105: {5800: 555.58, 5923: 581.09, 4956: 364.13, 4688: 336.5}, 4801: {6024: 271.92}, 2591: {4688: 587.53}, 3713: {4688: 182.82, 6476: 690.25}, 5626: {5899: 274.37}, 1009: {4688: 765.84, 1810: 184.38, 3988: 763.21, 2711: 600.07}, 3659: {4688: 478.81}, 2711: {5120: 515.11, 3005: 55.56, 3423: 515.05}, 608: {4688: 2189.57}, 466: {1810: 538.52, 598: 94.22}, 58: {4688: 1590.01, 1810: 977.72, 5923: 1892.81}, 323: {466: 189.54, 3423: 966.38}, 266: {4688: 1838.67}, 5646: {5923: 58.81, 5804: 122.2, 5741: 259.21, 6476: 272.09}, 1635: {2787: 678.92, 3285: 738.59}, 2976: {4688: 513.83, 4657: 517.67, 5923: 701.98, 6508: 1062.26}, 1597: {4688: 573.25, 3423: 504.8}, 2686: {5804: 780.74}, 2909: {4688: 466.06}, 2280: {4688: 446.38}, 1177: {4688: 727.66, 5120: 801.15, 1810: 146.2, 4445: 705.22}, 874: {5238: 999.29}, 564: {5120: 977.71, 796: 89.82, 3534: 783.54, 598: 23.43}, 3857: {4688: 166.43, 4981: 392.55, 3910: 36.13}, 3986: {4484: 513.28}, 35: {5772: 2072.97}, 4587: {4688: 510.23, 4706: 542.57}, 3995: {5120: 211.21, 5923: 508.47, 5577: 385.41, 4688: 205.67, 4956: 346.6}, 2372: {5856: 874.69}, 1075: {1094: 114.57}, 3465: {4688: 433.59, 4484: 918.56, 3943: 387.11}, 1871: {4348: 503.38, 2430: 156.09, 4956: 695.34}, 5210: {6476: 435.04}, 3374: {4483: 304.64}, 4358: {4688: 398.14}, 3772: {5923: 762.95, 4011: 106.61, 4688: 456.2, 5107: 535.67, 4925: 291.44, 4319: 186.96}, 4148: {4688: 123.95, 4259: 77.7}, 3759: {4319: 243.32}, 4791: {5120: 327.79, 5089: 310.3, 4956: 43.73}, 1071: {4688: 902.83, 5804: 1132.18}, 5800: {6508: 685.33, 5804: 203.27, 6215: 464.6}, 3783: {4688: 476.23, 4531: 169.85, 4956: 617.16}, 2702: {4688: 440.05}, 2620: {4688: 474.44, 3066: 156.5, 3560: 242.79, 3055: 113.74}, 2391: {6135: 986.19}, 4659: {5804: 276.87}, 2253: {3560: 508.74}, 2339: {4688: 448.62, 2425: 517.53, 3772: 380.75}, 3054: {4688: 411.76, 5804: 584.66, 4484: 351.85}, 1001: {1009: 76.77, 1146: 778.86}, 815: {5120: 898.17, 5677: 1027.91, 4688: 824.68, 4881: 1059.68, 1104: 156.35, 2425: 684.32, 1146: 700.69}, 3247: {5502: 449.25}, 4956: {5120: 292.3, 5923: 231.98, 5476: 263.69, 5107: 99.2}, 4780: {5411: 181.03}, 856: {1146: 180.28}, 4925: {5923: 470.29, 5804: 387.45, 6215: 385.15}, 975: {4688: 764.92}, 4636: {4920: 53.44, 4688: 251.29}, 1455: {4688: 709.87, 3506: 674.39}, 5252: {5741: 415.88}, 1152: {2425: 279.25}, 3988: {4688: 559.07, 6659: 1197.91, 4484: 1044.04}, 4319: {5800: 905.19, 4609: 580.42, 4688: 585.77, 6476: 1045.62}, 3846: {5120: 455.78, 4688: 493.15, 4445: 122.03}, 6476: {6508: 665.47}, 872: {6476: 1256.12, 3639: 775.91}, 1897: {4688: 556.02}, 1014: {2665: 469.83}, 5084: {5804: 169.02}, 3123: {5120: 375.5, 5923: 569.78, 3304: 65.78, 5135: 353.82, 4688: 266.98}, 2099: {4484: 714.39, 4981: 744.61}, 741: {4688: 824.44, 1000: 157.99, 4055: 699.89}, 3754: {5890: 563.01, 4484: 152.43, 4202: 123.49}, 4591: {5577: 199.25, 5804: 239.46}, 1716: {3772: 513.11}, 4093: {4688: 309.21}, 4202: {6051: 559.56}, 1563: {3943: 559.94}, 1436: {2425: 246.94}, 2341: {2561: 114.69}, 5158: {5923: 401.36, 6215: 476.45}, 3387: {3465: 58.04}, 4135: {4688: 391.24, 4484: 129.01}, 4173: {4688: 234.73}, 2241: {2817: 193.23}, 3652: {4688: 1485.36}, 457: {497: 51.99}, 4502: {4688: 361.24}, 3217: {4688: 564.18, 6476: 844.72}, 3005: {4688: 298.25}, 2584: {3154: 813.4, 3012: 694.88}, 4819: {4864: 49.61}, 2138: {4706: 536.98}, 902: {4688: 798.75, 1810: 217.29}, 4488: {6128: 596.59}, 4920: {6476: 456.03}, 4930: {4956: 244.35}, 3298: {4688: 350.9, 5120: 356.48, 4864: 286.3}, 2106: {4688: 532.65}, 1800: {3772: 465.7}, 4104: {4688: 827.69}, 3475: {4817: 253.71}, 2514: {4688: 599.31}, 30: {4688: 1861.71}, 5394: {6366: 303.2}, 2565: {3639: 638.56}, 5787: {6476: 296.65}, 5484: {5800: 809.74}, 2778: {4688: 540.61, 3639: 751.97}, 4348: {4688: 88.45}, 4706: {5923: 313.94, 5709: 321.17, 5646: 296.97}, 6355: {6476: 71.49}, 2307: {5923: 757.0}, 2414: {5120: 544.71}, 13: {4688: 1793.53, 1329: 1377.16, 371: 955.53, 855: 1450.24}, 4950: {6476: 439.71}, 3738: {4688: 366.67}, 987: {1800: 260.13, 1810: 262.09}, 3187: {4259: 323.29}, 5624: {5804: 40.99}, 6215: {6669: 338.65}, 2163: {4688: 552.13, 2425: 85.89}, 2057: {4591: 608.43}, 5220: {5878: 283.95}, 1579: {2217: 137.12}, 3866: {5923: 931.77}, 5992: {6684: 503.71}, 1585: {4688: 2314.92}, 1629: {4688: 1060.06, 5923: 1096.3}, 894: {2733: 399.15}, 1282: {2425: 680.06, 5804: 930.55}, 3693: {4051: 64.29}, 523: {4688: 2017.65}, 3549: {5800: 567.49, 4484: 224.79}, 4489: {4688: 512.07}, 1796: {2425: 543.12}, 1723: {2430: 754.3}, 5156: {5923: 428.59}, 341: {764: 1541.2}, 1412: {1546: 64.69, 1810: 415.68}, 4810: {4925: 1080.29}, 6381: {6701: 248.19}, 3488: {4688: 309.75}, 5344: {5923: 148.65}, 5737: {5741: 121.93}, 1564: {4688: 655.57}, 4160: {4688: 1235.77}, 2663: {3846: 715.79}, 4115: {5804: 1300.6}, 1546: {1810: 385.49, 5301: 762.4}, 762: {5677: 997.63}, 4336: {4925: 106.29}, 2970: {4956: 594.63, 3639: 169.29}, 2460: {4484: 429.57}, 4666: {5120: 152.29}, 958: {5582: 1043.77}, 1845: {4688: 579.78}, 4247: {4688: 1074.06}, 2364: {4688: 1122.91}, 1110: {1810: 313.43}, 3402: {4688: 209.02}, 1676: {1810: 47.31}, 1309: {2620: 735.06}, 1718: {4688: 627.4}, 3848: {5120: 854.45}, 1128: {2425: 265.8}, 3329: {3560: 39.25}, 1214: {4688: 958.69}, 1131: {4484: 977.45}, 151: {4688: 1672.71}, 2668: {3639: 784.22}, 5911: {6215: 85.54}, 2511: {4688: 449.88}, 165: {3943: 1393.24}, 5238: {5800: 164.46}, 1211: {4688: 1147.47}, 3243: {4688: 239.1}, 1179: {2217: 278.44}}

Maintenant il suffit juste de remplir le df

In [20]:
ticketdistances = []
for line in ticket_data.index:
    cities = [ticket_data["o_city"][line],ticket_data["d_city"][line]]
    tdist = distanceStorage[min(cities)][max(cities)]
    ticketdistances.append(tdist)

On rajoute une catégorie de distance à chaque ticket (court, moyen, long, très long)

In [21]:
ticket_data["distance"]=ticketdistances
ticket_data["distanceType"] = ["court" if distance < 200 else "moyen" if distance < 800 else "long"if distance < 2000 else "très long" for distance in ticketdistances]
ticket_data["distanceType"].value_counts()
Out[21]:
moyen        56939
court        11311
long          5862
très long       56
Name: distanceType, dtype: int64

On crée un df afin d'avoir seulement un id et un mode de transport pour chaque provider

In [22]:
types = providers[["id","transport_type"]]
types.head(10)
Out[22]:
id transport_type
0 9 bus
1 10 bus
2 11 bus
3 12 bus
4 13 bus
5 14 car
6 15 bus
7 16 bus
8 17 bus
9 18 bus
In [23]:
types.transport_type.value_counts()
Out[23]:
bus           214
train          10
carpooling      2
car             1
Name: transport_type, dtype: int64

On joint notre mode de transport à la table ticket via le provider

In [24]:
finalTickets = pd.merge(ticket_data,types,left_on="company" ,right_on="id")
finalTickets.head(10)
print("row delta",len(ticket_data),len(finalTickets))
row delta 74168 74168
In [25]:
finalTickets.transport_type.value_counts()
Out[25]:
carpooling    41441
train         18929
bus           13798
Name: transport_type, dtype: int64
In [26]:
finalTickets["price"] = finalTickets.price_in_cents/100
In [27]:
import plotly.express as pex
fig = pex.scatter(finalTickets,x="price",y="distance",facet_col="transport_type",facet_row="distanceType", title = "Facet plot pour observer le lien entre distance et prix")
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.show()

Nous pouvons observer un relation linéaire modérée entre la distance et le prix d'un billet

In [28]:
import datetime
#On converti les timestamp en datetime utilisables
finalTickets['arrival_ts']= pd.to_datetime(finalTickets['arrival_ts'])
finalTickets['departure_ts']= pd.to_datetime(finalTickets['departure_ts'])
#On calcule la durée du trajet
finalTickets["time delta"] = finalTickets.arrival_ts - finalTickets.departure_ts
print(finalTickets["time delta"].describe())



#On stocke seulement la durée du trajet en heures
finalTickets["time delta"]=pd.to_timedelta(finalTickets["time delta"])
finalTickets["time delta"] = finalTickets["time delta"].dt.days*24 + finalTickets["time delta"].dt.seconds/3600



#On crée le df dans lequel on group by par distance et mode de transport pour faire notre comparaison de vitesse et de prix
averagePriceDf = finalTickets.groupby(["transport_type","distanceType"]).agg({"time delta":"mean","price":"mean"})
averagePriceDf.reset_index(inplace=True)
averagePriceDf["ordering"]=[["court","moyen","long","très long"].index(elem) for elem in averagePriceDf["distanceType"]]
averagePriceDf.sort_values("ordering",inplace=True)
count                     74168
mean     0 days 07:04:37.247600
std      0 days 09:54:58.881351
min             0 days 00:20:00
25%             0 days 03:00:00
50%             0 days 04:50:00
75%             0 days 08:00:00
max            20 days 12:51:00
Name: time delta, dtype: object
In [29]:
pex.bar(averagePriceDf,x="distanceType", y="time delta",color="transport_type", barmode = "group", title="Durée du voyage en fonction du mode de transport et de la catégorie de distance")
In [30]:
pex.bar(averagePriceDf,x="distanceType", y="price",color="transport_type", barmode = "group", title = 'Durée moyenne du voyage en fonction du mode de transport et de la catégorie de distance')

Modèle de pricing

Approche Naive

Comparons d'abord les prix par mois, tranche de kilomètres afin d'avoir un prix moyen, ensuite nous pourrons rajouter des extras en fonction de ce prix de base:

  • un bonus/malus de réservation en avance
  • un bonus/malus en fonction du jour de la semaine (tarifs plus élevée le vendredi le week end et le lundi)
  • un bonus/malus avec l'inflationa annuelle
  • un bonus/malus en fonction de la distance dans la tranche
In [49]:
finalTickets["month"] = finalTickets["departure_ts"].dt.month
finalTickets["year"] = finalTickets["departure_ts"].dt.year
finalTickets["year"] = finalTickets["year"].astype(str)
In [50]:
monthlyPrice = finalTickets.groupby(["month","transport_type","distanceType"]).agg({"price":"mean"}).reset_index()
yearlyPrice = finalTickets.groupby(["year","transport_type","distanceType"]).agg({"price":"mean"}).reset_index()
In [51]:
pex.bar(monthlyPrice,x="month",y="price",color="distanceType",facet_col="transport_type",barmode="group",facet_col_wrap=2)
In [52]:
pex.bar(yearlyPrice,x="distanceType",y="price",color="year",facet_col="transport_type",barmode="group",facet_col_wrap=2)
In [35]:
days = ["Lundi","Mardi","Mercredi","Jeudi","Vendredi","Samedi","Dimanche"]
finalTickets["day"] = [days[elem] for elem in finalTickets["departure_ts"].dt.weekday]
daily = finalTickets.groupby(["day","transport_type"]).agg({"price":"mean"}).reset_index()
daily["ordering"]=[["Lundi","Mardi","Mercredi","Jeudi","Vendredi","Samedi","Dimanche"].index(elem) for elem in daily["day"]]
daily.sort_values("ordering", inplace=True)
pex.bar(daily,x="day",y="price",facet_col="transport_type",barmode="group",facet_col_wrap=2)
In [36]:
finalTickets['search_ts']= pd.to_datetime(finalTickets['search_ts'])
serie = finalTickets.departure_ts - finalTickets.search_ts
finalTickets["deltaresa"] = serie.dt.days
finalTickets["deltaresa"].describe()
Out[36]:
count    74168.000000
mean         6.722036
std         10.582113
min        -24.000000
25%          1.000000
50%          4.000000
75%          8.000000
max        176.000000
Name: deltaresa, dtype: float64
In [56]:
deltaresa = finalTickets.groupby(["deltaresa","transport_type"]).agg({"price":"mean"}).reset_index()
earlyBirds = deltaresa[deltaresa["deltaresa"]>88].groupby("transport_type").agg({"deltaresa":"min","price":"mean"})
mask = (deltaresa["deltaresa"]<=88) & (deltaresa["deltaresa"]>0)
deltaresa = deltaresa[mask]
deltaresa.append(earlyBirds)

fig = pex.bar(deltaresa,x="deltaresa",y="price",facet_col="transport_type",facet_col_wrap=2)
fig.show()

Malhereusement nous manquons d'amplitude sur la date des différents billets le modèle qui en découlera ne sera pas bon, du moins avec une approche naive nous allons donc tenter avec une random forest

Petit test en cross validation en random forest sans parameter tuning

Regardons d'abord les données que nous avons utiliser pour train notre model

In [38]:
finalTickets.columns
Out[38]:
Index(['id_x', 'company', 'o_station', 'd_station', 'departure_ts',
       'arrival_ts', 'price_in_cents', 'search_ts', 'middle_stations',
       'other_companies', 'o_city', 'd_city', 'distance', 'distanceType',
       'id_y', 'transport_type', 'price', 'time delta', 'month', 'year', 'day',
       'deltaresa'],
      dtype='object')
In [39]:
def GetCityPos(id):
    city = citiescorrected[id:id+1]
    return [city["longitude"][id],city["latitude"][id]]
pricingData = finalTickets.copy()
pricingData["long1"] = [GetCityPos(id)[0] for id in pricingData["o_city"]]
pricingData["lat1"] = [GetCityPos(id)[1] for id in pricingData["o_city"]]
pricingData["long2"] = [GetCityPos(id)[0] for id in pricingData["d_city"]]
pricingData["lat2"] = [GetCityPos(id)[1] for id in pricingData["d_city"]]
In [40]:
pricingData.drop(["id_x","o_station","d_station","price_in_cents","search_ts","middle_stations","other_companies","o_city","d_city","distanceType","id_y","departure_ts","arrival_ts"], axis=1,inplace=True)
pricingData["day"] = finalTickets["departure_ts"].dt.weekday
pricingData["transport_type"] = [["carpooling","bus","train"].index(elem) for elem in pricingData["transport_type"]]
pricingData.dtypes
Out[40]:
company             int64
distance          float64
transport_type      int64
price             float64
time delta        float64
month               int64
year                int64
day                 int64
deltaresa           int64
long1             float64
lat1              float64
long2             float64
lat2              float64
dtype: object
In [43]:
pricingData.dropna(how='any')
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestRegressor
y = pricingData['price']
X = pricingData.drop("price", axis=1)
model = RandomForestRegressor(n_estimators=50)
cv_results = cross_validate(model, X, y, cv=6, return_train_score=True,verbose=2, n_jobs = -1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   6 | elapsed:   11.3s remaining:   11.3s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   13.0s finished
In [44]:
cv_results
Out[44]:
{'fit_time': array([10.90669084, 10.93071985, 11.2762444 , 10.64423251, 10.47023344,
        10.52122903]),
 'score_time': array([0.07799911, 0.07596874, 0.10501456, 0.10899997, 0.09900141,
        0.08299923]),
 'test_score': array([0.93660632, 0.93521777, 0.92227197, 0.73022011, 0.71835841,
        0.5121966 ]),
 'train_score': array([0.98845065, 0.98835999, 0.98903987, 0.98913963, 0.98982655,
        0.98578288])}

Le modèle présente de l'overfit sur les dernier fit, cela doit être lié au fait que je n'ai ni stratifié ni mélangé le training set

In [47]:
model.fit(X,y)
model.predict(X[:10])
Out[47]:
array([44.79942857, 17.36750007, 16.54237484, 16.54237484, 16.88516194,
       16.88516194, 17.36750007, 17.16374764, 16.54237484, 16.54237484])
In [53]:
y[:10]
Out[53]:
0    45.5
1    18.0
2    21.5
3    17.0
4    17.0
5    19.0
6    15.5
7    14.5
8    15.5
9    15.5
Name: price, dtype: float64

Les résultats sont assez proches et assez bon à première vue, avec du parameters tuning nous pourrions avoir d'excellents résultats

In [54]:
model.predict(X[-10:])
Out[54]:
array([ 41.005 ,  44.1   ,  65.93  ,  26.394 , 149.006 ,  51.2622,
        51.735 ,  20.376 ,  50.128 ,  34.9116])
In [55]:
y[-10:]
Out[55]:
74158     38.29
74159     41.30
74160     71.40
74161     24.50
74162    163.95
74163     46.87
74164     47.77
74165     19.90
74166     45.90
74167     33.00
Name: price, dtype: float64

Conclusion

Ma courte démonstration de mon utilisation de pandas d'achève ici. Comme vous l'avez dit le travail est infini et j'ai donc l'impression de rendre quelque chose d'innachevé. J'espère avoir fait au moins ce que vous attendez, j'ai essayé de faire quelques bonus que vous avez conseillé. Je vous remercie pour ce travail qui a été très intéressant sur une data et une problématique intéressante.